{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 字符串\n",
    "\n",
    "对于单个字符的编码，Python提供了`ord()`函数获取字符的整数表示，`chr()`函数把编码转换为对应的字符："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "65\n",
      "20013\n",
      "B\n",
      "中\n"
     ]
    }
   ],
   "source": [
    "print(ord(\"A\"))\n",
    "print(ord(\"中\"))\n",
    "print(chr(66))\n",
    "print(chr(20013))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果知道字符的整数编码，还可以用十六进制这么写`str`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "中文\n"
     ]
    }
   ],
   "source": [
    "print('\\u4e2d\\u6587')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "两种写法完全是等价的。\n",
    "\n",
    "由于Python的字符串类型是`str`，在内存中以Unicode表示，一个字符对应若干个字节。如果要在网络上传输，或者保存到磁盘上，就需要把`str`变为以字节为单位的`bytes`。\n",
    "\n",
    "Python对`bytes`类型的数据用带b前缀的单引号或双引号表示：\n",
    "\n",
    "```\n",
    "c = b'ABC'\n",
    "```\n",
    "\n",
    "要注意区分'ABC'和b'ABC'，前者是`str`，后者虽然内容显示得和前者一样，但`bytes`的每个字符都只占用一个字节。\n",
    "\n",
    "以Unicode表示的`str`通过`encode()`方法可以编码为指定的`bytes`，例如：\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "b'ABC'\n",
      "b'\\xe4\\xb8\\xad\\xe6\\x96\\x87'\n"
     ]
    }
   ],
   "source": [
    "print('ABC'.encode('ASCII'))\n",
    "print('中文'.encode('utf-8'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "ename": "SyntaxError",
     "evalue": "invalid syntax (<ipython-input-10-5d258497d09b>, line 1)",
     "output_type": "error",
     "traceback": [
      "\u001b[0;36m  File \u001b[0;32m\"<ipython-input-10-5d258497d09b>\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m    print('中文'encode('ASCII'))\u001b[0m\n\u001b[0m                   ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
     ]
    }
   ],
   "source": [
    "print('中文'encode('ASCII'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "纯英文的`str`可以用ASCII编码为`bytes`，内容是一样的，含有中文的`str`可以用UTF-8编码为`bytes`。含有中文的`str`无法用ASCII编码，因为中文编码的范围超过了ASCII编码的范围，Python会报错。\n",
    "\n",
    "在`bytes`中，无法显示为ASCII字符的字节，用\\x##显示。\n",
    "\n",
    "反过来，如果我们从网络或磁盘上读取了字节流，那么读到的数据就是`bytes`。要把`bytes`变为`str`，就需要用`decode()`方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(b'ABC'.decode('ASCII'))\n",
    "print(b'\\xe4\\xb8\\xad\\xe6\\x96\\x87'.decode('utf-8'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果bytes中包含无法解码的字节，decode()方法会报错："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(b'\\xe4\\xb8\\xad\\xff'.decode('utf-8'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果bytes中只有一小部分无效的字节，可以传入errors='ignore'忽略错误的字节："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "b'\\xe4\\xb8\\xad\\xff'.decode('utf-8', errors = 'ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "要计算`str`包含多少个字符，可以用`len()`函数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(len('ABC'))\n",
    "print(len('中文'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "len()函数计算的是str的字符数，如果换成bytes，len()函数就计算字节数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(len(b'ABC'))\n",
    "print(len(b'\\xe4\\xb8\\xad\\xe6\\x96\\x87'))\n",
    "print(len('中文'.encode('utf-8')))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可见，1个中文字符经过UTF-8编码后通常会占用3个字节，而1个英文字符只占用1个字节。\n",
    "\n",
    "在操作字符串时，我们经常遇到str和bytes的互相转换。为了避免乱码问题，应当始终坚持使用UTF-8编码对str和bytes进行转换。\n",
    "\n",
    "由于Python源代码也是一个文本文件，所以，当你的源代码中包含中文的时候，在保存源代码时，就需要务必指定保存为UTF-8编码。当Python解释器读取源代码时，为了让它按UTF-8编码读取，我们通常在文件开头写上这两行：\n",
    "\n",
    "```\n",
    "#!/usr/bin/env python3\n",
    "# -*- coding: utf-8 -*-\n",
    "```\n",
    "\n",
    "第一行注释是为了告诉Linux/OS X系统，这是一个Python可执行程序，Windows系统会忽略这个注释；\n",
    "\n",
    "第二行注释是为了告诉Python解释器，按照UTF-8编码读取源代码，否则，你在源代码中写的中文输出可能会有乱码。\n",
    "\n",
    "申明了UTF-8编码并不意味着你的.py文件就是UTF-8编码的，必须并且要确保文本编辑器正在使用UTF-8 without BOM编码："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 格式化字符串\n",
    "\n",
    "在Python中，采用的格式化方式和C语言是一致的，用%实现，举例如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Hello %s' % 'world')\n",
    "print('Hello %s %d' % ('world', 1000))\n",
    "print('Hi, %s you have $%d.' % ('Micro', 12345))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "你可能猜到了，%运算符就是用来格式化字符串的。在字符串内部，%s表示用字符串替换，%d表示用整数替换，有几个%?占位符，后面就跟几个变量或者值，顺序要对应好。如果只有一个%?，括号可以省略。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "常见的占位符有：\n",
    "\n",
    "占位符|替换内容\n",
    ":---:|:---:\n",
    "%d|整数\n",
    "%f|浮点数\n",
    "%s|字符串\n",
    "%x|十六进制整数\n",
    "\n",
    "其中，格式化整数和浮点数还可以指定是否补0和整数与小数的位数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('%2d-%02d' % (3, 1))\n",
    "print('%.2f' % 3.1415926) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果你不太确定应该用什么，%s永远起作用，它会把任何数据类型转换为字符串："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Age:%s, Gender:%s' % (29, True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有些时候，字符串里面的%是一个普通字符怎么办？这个时候就需要转义，用%%来表示一个%："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('rate: %d%%' % 7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## format()\n",
    "\n",
    "另一种格式化字符串的方法是使用字符串的format()方法，它会用传入的参数依次替换字符串内的占位符{0}、{1}……，不过这种方式写起来比%要麻烦得多："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Hello, {0}, 成绩提升了{1:.1f}%'.format('小明', 17.125))\n",
    "print('Hello, {0}, 成绩提升了{1:.1f}%'.format('小黑', 17.155))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
